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Abstract 

Bayesian models provide a framework for probabilistic modelling of complex 
datasets. However, many of such models are computationally demanding espe¬ 
cially in the presence of large datasets. On the other hand, in sensor network 
applications, statistical (Bayesian) parameter estimation usually needs distributed 
algorithms, in which both data and computation are distributed across the nodes 
of the network. In this paper we propose a general framework for distributed 
Bayesian learning using Bregman Alternating Direction Method of Multipliers 
(B-ADMM). We demonstrate the utility of our framework, with Mean Field Vari¬ 
ational Bayes (MFVB) as the primitive for distributed Matrix Factorization (MF) 
and distributed affine structure from motion (SfM). 


1 Introduction 

Traditional setting for many machine learning algorithms is the one where the model (e.g., a 
classifier or a regressor, typically parametric in some sense) is constructed from a body of data 
by processing this body in either batch or online fashion. The model itself is centralized and the 
algorithm has access to all model parameters and all data points. However, in many application 
scenarios today it is not reasonable to assume access to all data points because they could be 
distributed over a network of sensor or processing nodes. In those settings collecting and processing 
data in a centralized fashion is not always feasible because of several important challenges. First, 
in applications such as networks of cameras mounted on vehicles, the networks are constrained 
by severe capacity and energy constraints, considerably limiting the node communications [1,5]. 
Second, in many distributed sensor network applications such as health care, ecological monitoring, 
or smart homes, collecting all data at a single location may not be feasible because of its sheer 
volume as well as potential privacy concerns. Lastly, the size of the centralized data would incur 
an insurmountable computational burden on the algorithm, preventing real-time or anytime [24] 
processing often desired in large sensing systems [4]. 


Distributed sensor networks provide an application setting in which distributed optimization tasks 
(including machine learning) that deal with some of the aforementioned challenges are frequently 
addressed [1,17]. However, they are traditionally considered in a non-Bayesian (often deterministic) 
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fashion. Moreover, the data is often assumed to be complete (not missing) across the network and 
in individual nodes. As a consequence, these approaches usually obtain parameter point estimates 
by minimizing a loss function based on the complete data and dividing the computation into subset- 
specific optimization problems. A more challenging yet critical problem is to provide full posterior 
distributions for those parameters. Such posteriors have the major advantage of characterizing the 
uncertainty in parameter learning and predictions, absent from traditional distributed optimization 
approaches. Another drawback of such approaches is that they traditionally rely on batch processing 
within individual nodes, unable to seamlessly deal with streaming data frequently present in sensing 
networks. Moreover, the deterministic distributed optimization approaches are not inherently able to 
account for various forms of the missing data, including missing at random or not-at-random [15]. 
However, both the sequential inference and the data completion would be naturally handled via the 
Bayesian analysis [3] if one could obtain full posterior parameter estimates in this distributed setting. 


In a recent work [2] proposed a new method that estimates parametric probabilistic models with 
latent variables in a distributed network setting. The performance of this model was demonstrated 
to be on par with the centralized model, while it could efficiently deal with the distributed missing 
data. Nevertheless, the approach has several drawbacks. First, its use of the Maximum Likelihood 
(ML) estimation increases the risk of overfitting, which is particularly pronounced in the distributed 
setting where each node works with a subset of the full data. Second, the approach cannot provide 
the uncertainty around the estimated parameters that may be crucial in many applications, e.g., 
in online learning for streaming data or in assessing confidence of predictions. In this paper we 
propose a Distributed Mean Field Variational Inference (D-MFVI) algorithm for Bayesian Inference 
in a large class of graphical models. The goal of our framework is to learn a single consensus 
Bayesian model by doing local Bayesian inference and in-network information sharing without 
the need for centralized computation and/or centralized data gathering. In particular, we demon¬ 
strate D-MFVI on the Bayesian Principle Component Analysis (BPCA) problem and then apply 
this model to solve two distributed Bayesian matrix factorization tasks, a matrix completion for 
item rating/preference modeling and the distributed structure-from-motion task in a camera network. 


The reminder of this paper is as follows. In Section|2] we briefly review MFVI for graphical models 
and then derive distributed MFVB. In Section[2 we review Bayesian PCA and then derive D-MFVI 
in this setting. In Section|4] we report experimental results of our model using both synthetic and real 
data. Finally, we discuss our approach including its limitations and possible solutions in Section|5] 

2 Distributed Mean Field Variational Inference (D-MFVI) 

We first explain a general parametric Bayesian model in a centralized setting and then we derive its 
distributed form. 

2.1 Centralized Setting 

Consider a data set X of observed D-dimensional vectors X = {xi G with the correspond¬ 
ing local latent variables Z = {zi G a global latent variable IL G and a set of hxed 

parameters fl = Hiuj. The main assumption of our class of models is the factorization of the 
joint distribution of observations, global and local variables into a global term and a product of local 
terms (the graphical representation of this class of models is shown in Fig[T]i: 

N 

P{X,Z,W\n) = P{W\n^)\{P{Xn\Zn,W)P{Zn\n,) ( 1 ) 

i=l 

Given the observation, the goal is to compute the approximation of the posterior distribution of the 
latent variables: P(IL, Z\X, H). For ease of computation, we consider exponential family assump¬ 
tion of the conditional distribution of a latent variable given the observation and the other latent 
variables: 


P{W\X, Z, = h{W)exp{ri^{X, Z, T{W) - Ay, (?7„(X, Z, fly ,))} (2) 
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Figure 1: A graphical representation of the model of Eq. O (blue-shaded circle denotes observa¬ 
tion). 


N 

P{z\x, w, n,) = (x„, w, n,)^T(zn) - A, (X, w, n,))} o) 

n—1 

where h(.) denotes the base measure, .4(.) denotes the log partition function, and ri{.) and T(.) 
denote the natural parameter and the sufficient statistics respectively. The assumed class of models 
contains many well known statistical models such as Bayesian PCA and Bayesian Mixture of PCA 
[8], Latent Dirichlet Allocation [6], Bayesian Gaussian Mixture model [7], Hidden Markov model 
[9,10], etc. 

In many probabilistic models, due to the intractability of computing the exact posterior distribution 
of the latent variables given the observations, we have to use approximate inference algorithms 
among which we use MFVI, which roots our strategy for distributed inference. Using conjugate 
priors for W and Z corresponding to exponential family of the conditional distributions, MFVI 
always returns a distribution in the same exponential family of the prior distributions. 

2.2 Mean Field Variational Inference (MFVI) 

The goal of the Variational Inference is to approximate the true posterior distribution over the latent 
variables with a simpler distribution indexed by a set of free parameters that is closest in KL diver¬ 
gence to the true posterior distribution [16]. MFVI is a subclass of VI that uses a family where all 
latent variables are independent of each other. More precisely, MFVI considers the following family 
of distributions as the approximate posterior distribution. 

N 

Q{Z, IF) = n Qi^n; KJQ{W; Xw) (4) 

n—1 

where the form of Q{zn\ Az„) and Q{W', Aw) are set to be in the same exponential family as the 
conditional distributions P{W\X, Z, (Eq.|2]i and P{Z\X, W, 0^) (Eq.[3 and Xz = {Az^jiJLi 
and Aw denote the variational parameters that are determined by maximizing the following varia¬ 
tional objective function (that is equivalent to minimizing the KL{Q{Z, W)\\P{Z, 1F|X, Hz, Hm)) 

[23]. 

C{Xz, Aw) = Eg [logP(Ar, Z, lF|Uz, f^iu)] - Eg [log Q] 

N N 

= '^'^Q{z„,W)[^OgP{Xn\Zn,W)] -f ^ Eg(z„) [ log P(z„ jUz)] -f Eg(w) [ log P(VF|H„)] 

n—1 n—1 

N 

- -Eg(w)[log(9(lF)] (5) 

n—1 

It should be noted that all terms of LiXz, Aw) are functions of the posterior parameters Xz, Aw- 


3 



2.3 Distributed Setting 


Consider G = (y,i?) as an undirected connected graph with vertices i, j C V and edges Cy = 
{i,j) G E connecting the two vertices [2]. Each i-th node is directly connected with 1-hop neighbors 
in Bi = {j\eij G E} [2]. Now, assume that each i-th node has its own set of data points Xi = 
{xin\n = local parameters Zi = {zin\n = and global parameter Wi where 

Xin G is n-th data point and Ni is the number of samples collected in i-th node. Each i-th node 
infers the approximate posterior distribution over both global and local parameters locally, based 
on the available data in that node. Computing the posterior distribution of the local parameters 
({Zi}[^[) is not an issue in the distributed setting due to the fact that the posterior distribution of 
each local latent variable Zn depends solely on the corresponding observation Xn and is independent 
of other observations (2f_„). A naive approach for computing the global parameter {W) posterior 
distribution is to impose an additional constraint on the global parameter in each node (Wi = W 2 = 
... = W\v\)- However, in a Bayesian framework, the parameters are random variables and the notion 
of equality can be replaced with equivalency. This, however, leaves several options open (e.g., strict 
equality, equality in distribution, or almost sure equality). Here, we propose imposing equivalency 
in distribution, i.e., imposing equality constraints on the parameters of the posterior distribution of 
the global variable in each node (Awi = ^W 2 = Similar to [2], for decoupling 

purposes, we define a set of auxiliary variables pij , one for each edge e^. This now leads to the 
final distributed consensus MEVl formulation that it is easy to show it is equivalent to the centralized 
MFVI optimization problem: 

[Az,Aw]= argmin - Eq [logP(X, Z, H^,)] -f EQpogQ] 

s.t. Xwi — Pij ^ Pij — ArVj; t G V, j G Bi (6) 

Alternating Direction Method of Multipliers (ADMM) [17] could be used to efficiently solve the 
above constrained optimization problem. More precisely, ADMM ulternatively updates the variables 
in a block coordinate fashion by solving the augmented Lagrangian (using a linear and a quadratic 
penalty term) of (|6]i: 


|V"| Ni \V\ N 

[\z,\w,P,l]= arg min . -EE ,Wi) [ \\0g P{Zin\^z)\ 

P.7Az,,Aw,:* 6V i=l„=l 

|t^l |t^l Ni |V| 

EQ(zin) [log Q{Zin)\ + ^ ^Q{Wi) [log Q{Wi)\ 

i—1 i—1 n—1 

+ EE iJjli^Wi - Pij) + lJj2{Pij - ^Wj) Mee \\\wi - Pij II 2 + \\Pij - Hi 

ieVjGBi ^ ^ i&V jGBi ^ 

(7) 

Using conjugate exponential family for prior and likelihood distributions, each coordinate descent 
update in MEVI can be done in closed form. However, the penalty terms would be quadratic in the 
norm difference of [Xwi — Pij that may result in the non-analytic updates for {Xwi ll^i (it should 

be noted that updating {Az^}[^can still be done in closed form because they do not appear in the 
equality constraints). To solveefficiently, we propose to use Bregman ADMM (B-ADMM) [18] 
rather than standard ADMM. B-ADMM simply generalizes the ADMM by replacing the quadratic 
penalty term by different Bregman divergences in order to exploit the structure of problems (a brief 
review of the B-ADMM provided in the Appendix). Since global parameters are the parameters of 
the natural exponential family distributions, we propose to use the log partition function v4iu(.) of 
the global parameter as the bregman function. It is worth noting that the Aw{-) is not a stricly convex 
function in general, but, it is estrictly convex if the exponential family is minimal (an exponential 
family is minimal if the functions r]{.) and the statistics T(.) each are linearly independent. We can 
always achieve this by reparametrization. Hence, using minimal representation of the exponential 
families, it is easy to show that using this bregman function, coordinate descent steps [8] and |9] of 
B-ADMM for solving |6] have a closed-form solution for most of exponential families (e.g. Normal, 

Gamma, Beta, Bernouli, Dirichlet). Based on the proposed Bregman function, we get the following 
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updates for BADMM: 


|t^l Ni \V\ N 

= argmin - ^ ^[logP(xi„|zi„, W^)] - ^ [logP(z„|0^)] 

IV'I |V| Ni |y| 

-E®^Q(H^o[iogWif^^)]+EE ^Q{zi„)[^OgQ{Zin)] +J2^QiWi)[iogQm)] 


2=1 


i—l n—1 

+ EE pfj )+ ^{pfj -) + ^EE, pfj) (8) 

iev j&Bi ^ ^ i&VjeBi 

p(‘+i) = argmin ^ ^ “ Pv) + 7^2+ ^E E (Py : 


2=1 


i£V jGB: 


i€V j^Bi 


(9) 


-(t+i) 

7zjfe 


(i) 

'lijk ' 


?7(A 


(*+i) 

Wi 


.(‘+1) 


) A = 1,2, ieVjGBi 


( 10 ) 


where ^ijkA, j with A: = 1, 2 are the Lagrange multipliers. The scalar value rj is the parameter 
of the ADMM algorithm that should be determined in advanced (refer [17] for more information 
about how to tune this parameter). (.,.) denotes the bregman divergence induced by 7lu,(.) and 
is defined as: 

BA„{x,y) = Ain{x) - Aw{y) - {x - y,VAw{y)) 

where V denotes the gradient operator and a:^) denotes the value of the parameter x at iteration 
t. It should be noted that since Bregman divergences are not necessarily convex in the second 
argument, we cannot use the same Ba„ {^Wi , pfj) for the bregman penalization term in|9] hence, 
Wang and Banerjee [18] proposed use of the bregman divergence with reverese of the parameters 
{Ba„ {pijj they proved the convergence of the new update equations. 

The intuition behind the proposed bregman function is as follows: based on the fact that the bregman 
divergence (using log partition function as bregman function) between two parameters A, A' of 
the same (minimal) exponential family V{x) is equivalent to the reverse KL divergence between 
the exponential families (Eq. [TTJ [23] and assuming that pij is the natural parameter of the same 
exponential family as Q\^. (.), penalizing the deviation of the posterior parameter Aw, from the 
parameter pij using the bregman divergence Ba„ (Ah/, , Pij ) is equivalent to penalizing the deviation 
of the approximate posterior distribution Qaw i^i) from the distribution Qp.^ (.) in KL sense. 


Ba^ (A, A') = 7l^(A) - 7l„(A') - (A - A', V7l»(A')) = KLiVy^x), Vx{x)) (11) 


3 Case Study: Distributed Bayesian PCA (D-BPCA) 

In what follows, we derive D-MFVI in the context of Bayesian PCA [8]. Consider a data set X 
of observed D-dimensional vectors X = {xn S with the corresponding latent variables 

Z = {zn G whose prior distribution is a zero mean Gaussian B(zn) = JV{Zn',0,1)' 

In Probabilistic PCA model, the observed variable Xn is then defined as a linear transformation of 
Zn with additive Gaussian noise e: Xn = WZn + p + where PL is a Zl x M matrix, /r is a 
d-dimensional vector and e is a zero-mean Gaussian-distributed vector with precision matrix t~^I. 
So, the likelihood distribution is: 

P{Xn\Zn,W,p,T) = M{Xn]WZn +P,T~^I), n=l,...,N (12) 

Based on the above probabilistic formulation of PCA, we can obtain a Bayesian treatment of PCA 
by first introducing a prior distribution P{p,W,a^) over the parameters of the model . Second, 
we compute the corresponding posterior distribution P{p, W, t\X) by multiplying the prior by the 
likelihood function given by (ITSl) . and normalizing. 

There are two issues that must be addressed in this framework: (i) the choice of prior distribution, 
and (ii) the formulation of a tractable procedure for computing the posterior distribution. Typically, 
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in BPCA, the prior distributions over parameters are defined such that they are independent of each 
other apriori P{n, W, r) = P{^)P{W)p{t). For simplicity, in this paper we assume that the data 
noise precision r is a fixed but unknown parameter. We define an independent Gaussian prior over 
each row of W as 

D , 

P{W\a) = - Wd)^{wd - Wd)} (13) 

d^l ^ 

where Wd is the d-th row of W, {wd}d=i {ad}d=i the mean and the precision hyper 
parameters respectively. Furthermore, we consider another Gaussian distribution as prior for fi: 
P(/i) = where p and 0 are the mean and the precision hyper parameters respectively. 

Due to the intractability of computing the exact posterior distribution of the parameters, we use 
MFVI to approximate the posterior distribution. In order to apply MFVB to Bayesian PCA we 
assume a fully factorized Q distribution of the form 

DM N D 

Q{W,Z,p)^ nn Qip^dm) ^(^n) Q(^{^d') (1^) 

d—1 m—1 n—1 d—1 

Due to the use of conjugate priors for fF, Z and //, the posterior distributions are Gaussian 

(Qiwdm) Q{pd) ~ and Q(zn) ~ Gaussian 

and their update is equivalent to re-estimation of the corresponding means and variances. 


3.1 Distributed formulation 


The distributed MFVI algorithm developed in Section |2] can be directly applied to this BPCA 
model. Based on the terminology of section |2] W and /i are global latent variables, and 
{zn}n=i are the local latent variables. The basic idea is to assign each subset of samples 
to each node in the network, and do inference locally in each node. By considering = 
i^dm)i’ as the Set of parameters for node i. The D-MFVI 

optimization now becomes: 


^ =arg min 
Si-.iev 


\v\ 

~ Z [ log P{X„ Zi, ITi |r, a, 


6, p, wi,..., Wd)] + EQ(vi/._^._z.)[log Q{Wi, Pi,Zi)] 


s.t. (m^)i = 

s-t- {md^)i = {Pdm)ij, {Pdm)ij = 

s.t. (A^), = = (A^)^., tGV,jeB, 

s.t. (A:L). = = i^dm)D i & V, J G B, (15) 


where {{Pd)ij, {4’d)iP iPdm)iP (^dm)v} afo auxiliary variables (we have explained how to specify 
hyper-parameters T,a,0, jl,wi,..., Wd and presented the coordinate descent update rules for solving 
the above optimization problem in the Appendix). Generalizing our distributed BPCA (D-BPCA) 
to deal with missing data is straightforward and follows [15]. 


4 Experimental Results 

In this section, we first demonstrate the general convergence properties of the D-BPCA algorithm 
based on synthetic data. We then apply our model to a set of SfM and MF problems. For both 
applications, we compared our distributed algorithm with traditional SVD, Centralized PCA (C- 
PCA) [2], D-PPCA [2], and Centralized BPCA (C-BPCA) [15]. 

4.1 Empirical Convergence Analysis 

We generated synthetic data (using Gaussian distribution) to show the convergence of D-BPCA in 
various settings. Based on the results (detailed results are available in the Appendix), D-BPCA is 
robust to topology of the network, the number of nodes in the network, choice of the parameter rj, 
and both data missing-at-random (MAR) and missing-not-at-random (MNAR) [15] cases. 
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Table 1: Results of Caltech dataset (all results ran 20 independent initializations). 


Object 
# Points 
^ Prames 

Ball Sander 
62 

30 

BoxStuff 

67 

30 

Rooster 

189 

30 

Standing 

310 

30 

StorageBin 

102 

30 

Subspace angle b/w centralized SVD SfM and D-PPCA (degree) 

Mean 

Variance 

1.4934 

0.4171 

1.4402 

0.4499 

1.4698 

0.9511 

2.6305 

1.7241 

0.4513 

1.2101 

Subspace angle b/w centralized SVD SfM and D-BPCA (degree) 

Mean 

Variance 

0.9910 

0.0046 

0.9879 

0.0986 

1.3855 

0.0080 

0.9621 

0.0033 

0.4203 

0.0044 


4.2 D-BPCA for Structure from Motion (SfM) 

In affine SfM, based on a set of 2D points observed from multiple cameras (or views), the goal is 
to estimate the corresponding 3D location of those points, hence the 3D structure of the observed 
object as well as its motion (or, equivalently, the motion of the cameras used to view the object). A 
canonical way to solve this problem is factorization [14]. More precisely, by collecting all the 2D 
points into a measurement matrix X of size ^points x 2 • ^frames, we can factorize it into a 
^points X 3 3D structure matrix W and a 3 x 2 • ^frames motion matrix Z. SVD can be used 
to find both W and Z in a centralized setting. C-PPCA and D-PPCA can also estimate W and Z 
using the EM algorithm [2]. Equivalently, the estimates of W and Z can also be found using our 
D-BPCA where W is treated as the latent global structure and Z is the latent local camera motion 
(it is worth noting that D-PPCA can only provide the uncertainty around the motion matrix Z, while 
D-BPCA obtains additional estimates of the variance of the 3D structure W). We now show that our 
D-BPCA can be used as an effective framework for distributed affine SfM. Eor all SfM experiments, 
the network was connected using the ring topology, with p = 10. We equally partitioned the frames 
into 5 nodes to simulate 5 cameras, the convergence was set to 10“^ relative change in objective 
of (|6]l. We computed maximum subspace angle between the ground truth 3D coordinates and the 
estimated 3D structure matrix as the measure of performance (for C-BPCA and D-BPCA, we used 
the posterior mean of 3D structure matrix for subspace angle calculation). 

4.2.1 Synthetic Data (Cube) 

Similar to synthetic experiments in [2], we used a rotating unit cube and 5 cameras facing the cube 
in a 3D space to generate synthetic data. In contrast to the setting in [2], we rotated the cube every 
3° over 150° clockwise to obtain additional views necessary for our online learning evaluation i.e. 
in this setting, each camera observed 50 frames. Pig. |2] shows the performance of different models 
in the case of noisy data (over 20 independent runs with 10 different noise levels). 

As can be seen from the figure, in the case of noisy data, D-BPCA consistently outperforms D- 
PPCA, thanks in part to improved robustness to overfitting. 

Eor MAR experiment, where we randomly discarded 20% of data points over ten independent runs. 
The average errors were 1.41° and 1.01° for D-PPCA and D-BPCA, respectively. The same experi¬ 
ment was done for the more challenging MNAR with the missing data generated by a realistic visual 
occlusion process (hence, non-random). This yielded errors of 17.66° for D-PPCA and 14.12° for 
D-BPCA. Again, D-BPCA resulted in consistently lower errors than D-PPCA, although the error 
rates were higher in the more difficult MNAR setting. 

One particular advantage of the D-BPCA over D-PPCA and the SVD counterparts is its ability to 
naturally support online Bayesian estimation in the distributed sensing network. We first used 10 
frames in each camera as the first minibatch of data. Then, we repeatedly added 5 more frames 
to each camera in subsequent steps. Results over 10 different runs with 1% noise in the data are 
given in Pig.[3](due to the non-Bayesian nature of the D-PPCA model, it cannot easily be applied in 
the online setting). Pig |3] demonstrates that the subspace angle error of the online D-BPCA closely 
follows centralized BPCA in accuracy. 
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Figure 2: Noisy data experiment for the cube synthetic data (crosses denote outliers). 
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Figure 3: Online data experiment for the cube synthetic data (crosses denote outliers). 


4.2.2 Real Data 

We applied our model on the Caltech 3D Objects on Turntable dataset [19] and Hopkinsl55 dataset 
[20] to demonstrate its usefulness for real data. Following [2], we used a subset of the dataset which 
contains images of 5 objects for Caltech dataset, and 90 single-object sequences for Hopkinsl55 
dataset. For both datasets, we used the same setup as [2]. The subspace angles (mean and variance 
of 20 independent runs) between the structure inferred using the traditional centralized SVD and 
the D-PPCA and D-BPCA for Caltech dataset are available in Table [T] As can be seen, the D- 
BPCA performance is better than D-PPCA (the MAR/MNAR results for the Caltech dataset and 
are available in the Appendix). Average maximum subspace angle between D-PPCA, D-BPCA and 


8 
































Table 2: Subspace angles (degree) between fully observable centralized SVD and D-PPCA / D- 
BPCA for Hopkins dataset (all results ran 5 independent initializations). 




No-missing 

MAR 

D-PPCA 

Mean 

3.9523 

13.4753 


Variance 

3.3119 

12.9832 

D-BPCA 

Mean 

0.7975 

6.4372 


Variance 

0.5684 

5.0689 


Table 3; Root mean squared error for matrix factorization experiment on Netflix dataset. 


C-PPCA D-PPCA C-BPCA D-BPCA 
1.7723 1.9934 1.4465 1.5613 


SVD for all selected 90 objects without missing data and with 10% MAR are shown in Table |2] 
(we did not perform MNAR experiments on Hopkins due to the fact that the ground truth occlusion 
information is not provided with the dataset). Also, for this dataset, D-BPCA consistently has better 
performance than D-PPCA. It should be noted that although the subspace angle error is very large 
for MAR case for both D-PPCA and D-BPCA, 3D structure estimates were similar to that of SVD 
only with orthogonal ambiguity issue. 


4.3 D-BPCA for Matrix Factorization (MF) 

Matrix factorization (MF) is a canonical approach for recommender systems due to its promising 
performance. In this section we apply our D-BPCA for large-scale MF problem using the Netflix 
dataset [22] that contains movie ratings given by = 480189 customers to D — 17770 movies. 
The PCA model is one of the most popular techniques considered by the Netflix contestants [21]. 
We used given R = 100480507 ratings from 1 to 5, and predicted the R' = 1408395 ratings 
of the probing set. We used 10 nodes to divide the whole rating matrix into 10 sub-blocks for 
local computation, where each sub-block contains D rows and A^/10 columns. We set p = 10 
and the number of principle components to 12. The root mean squared error (RMSE) of different 
methods (for Bayesian models, we simply used the posterior means to do reconstruction) is shown 
in Table 13 from which we can see that by our model, local Bayesian computation effectively arrives 
at the solution of centralized approaches that handle all data in a single batch. In addition to lower 
RMSE the crucial benefit of the distributed method is that the data is distributed over different nodes, 
reducing storage requirements and enabling parallel computation, hence computational speedups. 


5 Discussion and Future Work 


In this paper we introduced a general approximate inference approach using Mean Eield Variational 
Inference for learning parameters of traditional centralized probabilistic models in a distributed set¬ 
ting. The main idea is to split the data into different nodes and impose the equality constraints 
on the posterior parameters of each node, and then, solving the constrained variational inference 
using Bregman ADMM. We illustrated this approach with BPCA for SfM and Matrix Eactoriza- 
tion applications and more importantly, this algorithm generalizes to many settings. We developed 
our framework with MEVI and closed form coordinate updates with conjugate exponential family 
models. Although this class of models is pervasive, the proposed framework cannot be used in non¬ 
conjugate models. A more principled approach would be to use our framework for more advanced 
variational inference algorithms such as collapsed variational inference [11,12] and structured vari¬ 
ational inference [13]. 
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Appendix 


A Bregman Alternative Direction Method of Multipliers 

ADMM has been successfully applied in a broad range of machine learning applications [17]. 
ADMM is canonically used for optimizing the following objective function subject to an equality 
constraint; 


argmin /(a;) + ^(z), s.t.Ax + Bz = c (16) 


where / and g are convex functions, A, B and c are some fixed terms. ADMM iteratively solve the 
the augmented Lagrangian of (ITfiT l. which is defined as follows: 

Lp{x, z, y) = f{x) + g{z) + {y, Ax + Bz - c) + ri/2\\Ax + Bz - c ||2 (17) 


where y is dual variable, p > 0 is penalty parameter, and the goal of quadratic penalty term is to 
penalize the violation of the equality constraint. ADMM iteratively solve the above optimization 
problem by spliting the variables using the following updates: 

xt+i = argmin f{x) + {yt,Ax + Bzt - c) + r]/2\\Ax + Bzt - c ||2 (18) 

X 

zt+i = argmin g{z) + {yt,Axt+i +Bz-c) + r]/2\\Axt+i + Bz - c\\l (19) 

Z 

yt+i =yt + viAxt+i + Bzt+i - c) (20) 


Bregman ADMM (BADMM) replaces the quadratic penalty term in ADMM by a Bregman diver¬ 
gence [18]. More precisely, the quadratic penalty term in the x and z updates (fTSt -dlOb will be 
replaced by a Bregman divergence in BADMM; 


xt+i = argmin f{x) + (j/t, Ax + Bzt - c) + r]B^{c - Ax, Bzt) (21) 

X 

zt+i = argmin g{z) + {yt, Axt+i + Bz - c) + r]B^{Bz, c - Axt+i) (22) 

Z 

yt+i =yt+ 'q{Axt+i + Bzt+I - c) (23) 

where : D x RI{n) M+ is the Bregman divergence with Bregman function (j> ((j)is a strictly 
convex function on D) that is defined as: 


y) = (l){x) - (j){y) - {V(j){y), x - y) (24) 


B Full Derivation of D-MFVI for BPCA 


Before describing the proposed model and deriving the ADMM equations, we briefly describe the 
notations: 
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{xn)i '■ a D X 1 vector which denotes the n-th data point in the i-th node 
{zn)i : a M X 1 vector which denotes the latent variable of the n-th data point in the i-th node 
{fi)i : a D X 1 vector which denotes the local mean in the i-th node 
{W)i : a D X M matrix which denotes the local weight matrix in the i-th node 
(t)^ : a scalar which denotes the local precision of the data noise in the i-th node 
{a)i : a scalar which denotes the local precision of the weight matrix noise in the i-th node 
{0)i : a scalar which denotes the local precision of the mean vector noise in the i-th node 
{m^)i : a M X 1 vector which denotes the mean of the n-th latent variable in the i-th node 
(A^)i : a M X M matrix which denotes the precision matrix of the n-th latent variable in the i-th node 
{m^)i : a scalar which denotes the mean parameter of the d-th element of the local mean(/r)iin the i-th node 
(A^)i : a scalar which denotes the variance of the d-th element of the local mean(p)iin the i-th node 
in the i-th node((A(^)idenotes the d-th element of this vector) 

^ scalar which denotes the mean parameter of the (d,m)-th entry of the local weight matrix(fF)iin the i-th node 
{^dm)i ■ ^ scalar which denotes the variance of of the (d,m)-th entry of the local weight matrix(fF)iin the i-th node 
'■ a scalar which denotes the Lagrange multiplyer corresponding to(m(^)i 
'■ a scalar which denotes the Lagrange multiplyer corresponding to(Aj^)i 
{ldm)i • ^ scalar which denotes the Lagrange multiplyer corresponding to(mJ^^)i 
{Pdm)i • ^ scalar which denotes the Lagrange multiplyer corresponding to(AJ^^)i 

Prior and Liklihood distributions are as follows: 

• P{Xn\Zn,W,^i,T)=J\f{Xn-,WZn+fJ,,T~^I), n=l,...,N 

• P{zn) = N'{zn;0,I), n = l,...,N 

• P{W\a) =ll^^-^^Af{wd;wd,diag{a~^)) 

where a = [ai, a 2 , cxm]- So, the joint distribution of the parameters and the observations are as 
follows: 

N DM 

P{X, Z, W, fxjr, a, 0) = n + fx, T-^imz„; 0, /) nn ^rn 

n—1 d—1 m—1 

D 

\{N{^id■,^id,6-^) (25) 

d^l 

The mean field Variational Distribution is as follows: 

• 9(VL, = {zi, Zn}) = qiZ)q(W)q{fd) 

• 9(m) = nfci-A/'(Ai;TOd,A;;) 

• = rifcl 11^=1 \dm) 

B.l Notations for Distributed BPCA 

Some Notations: 

• G = {V, E): Undirected connected graph with vertices in V and edges in E 

• i,jGV: Node index 

• Cij € E: Edge connecting node i and node j 
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• Bi = {j\eij € E: Set of neighbor nodes directly connected to i-th node 

• TViiThe number of samples collected in i-th node 

• {znU: n-th M X 1 dimensional latent variable at node i where n = 1,Ni 

• {xn)i- n-th D X 1 dimensional column vector at node i where n = 1, Ni 

• Zi — {Tl — 1, . 

• Ni — Tl — 1, 

• Oi = (A(^)i, {m^^)i, (A^^) J : Set of local parameters of the i-th 

node that need to be optimized through B-ADMM algorithm. 


B.2 Distributed Algorithm for BPCA 

In this section, we derive an iterative Variational Bayes (VB) algorithm for D-BPCA using B- 
ADMM: 


Qi = argmin - Eq.[log P{Xi, Z^, IVi, /r*, t*, a*, 9i)] - H[qi\ 

fti 

s.t. {m^d)i = {Pd)ij, {P%j = {m^d)h i&VJeBi 
s.t. (A^X = (</,^),,, = (A^)^., ieV,jeB, 

s-t- imdm)^ = iplm)ij, iPdmh = imdm)p i & V, j £ B^ 

{Klm)i = i4’dm)ijt i G V,j G B^ 

(26) 

where {(Pd)ij, (<f>d)ij> (Pdm)ij’ (^dm)iE } auxiliary variables. If we solve this local optimiza¬ 
tion problem, we also solve global optimization since global optimization is simply the summation 
of local ones given consensus constraints meet. It should be noted that we put the same hyper¬ 
parameters for all nodes. [log P(Ai, Zi,Wi, p,i,Ti,ai, 9i)] can be expanded as follows: 


li 

2 


Ni 


Ni 


Ni 


Ni 


Xl(a:n)r {Xu)i - 2 ((fP)*)((Zn)*) - 2j2ixu)I {{p)i) + {{W)i){{Zn)i) 


\ Air) 1 1 

n—1 n—1 ' n—1 n—1 

1 D ^^ Q- 

+ -G '^{{{Wd)i - {wd)iVdiag{ai,){{wd)i - (wd)i)) - — ^ loga„^ -f -^{{{p)i - {p.)i)^{{p)^ - {p)i)) 


D, ^ 1 


Ni 


D M 


m—1 

D 


-—logOi -f -^log(lA^l) -J 2 Y 1 log((^dm)0 


(27) 


n—1 


d—1 m—1 


d^l 


where (.) denotes the expectation operator respect to the approximate distribution q. H[qi] is com¬ 
puted as: 


H[qi] = H[qi{Zi)] + H[q,iWi)] + Riq^ipi)] 

Ni DM D 

= T.H<lAiZn)i]+Y.T.H,i [(w'dj)] 

n—1 d—lj—1 d—1 

^ Ni DM ^ D 

= “ 2 ^ log (l^nl) + XI XI log ((^*n)i) + 2 X log ((^dh) ( 28 ) 

n—1 d—1 m—1 d—1 
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So, the augmented Lagrangian of (l26] t can be expressed as bellow. It should be noted that we utilized 
the Eq. 10 of the main paper (using KL divergence) to drive the update equations. 


Ni 


Ni 


Ni 


Ni 




^=1 


'i—l 


n—1 


n—1 


m m X 1 1 _ 

+ (Ai)») +'^{{Zn)I {W)f {W)i{Zr,)i)j -logT, + - ^ (( ( 0 „),)^ ((z„)i)) - - ^ log (|A^|) 

n=l n—1 ' n—1 n—1 

1 D 0- 

+ ^ - {wd)Ndiag{ai){{wd)r - {wd)i)) - — ^ loga„ + -^{{{g)i - (m)*)^((m)* “ (m)*)) 


D, ^ 1 


Ni 


D M 


m—1 

D 


- — log6», + -^log(lA^l) -Xm log(Wm)i) 


d—1 m—1 


d^l 



ihdJ ijk }’ {i'yd)ijk}, {{PdrrJijk}, {{Pd)ijkh with k = 1,2 are the Lagrange multipliers, and 
p is a positive scalar. We solve the above optimization problem using B-ADMM. We cyclically 
minimize the above objective function (£(17^)) over its parameters, then follow a gradient ascent 
step over the Lagrange multipliers. For deriving the update equations of the parameters, we omit t 
temporarily for notational brevity. 


Update for (Z„)i 

To optimise the latent parameters (z„)i, we can directly optimise Equation ( l26l l. However, since our 
variational approximation are in the exponential family, and there is no ADMM constraints on these 
parameters, we can instead directly give the update Equations for {zn)i given all the rest. Hence, it 
is easy to show that and A^ can be updated as (we omit t notations): 

(A:),=/ + t,((1L)7(1L).) (30) 
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So, we have: 




(31) 


{K), = I + rW {Ml M, + Y.{K)i") (32) 

d^l 

{mDi = Ti{kD~'^Ml{{Xn)i - ml) (33) 

where 

-{m^)i- 

= : (34) 

-{mf,)i-_ 

Update for auxilary parameters {pDij, {(l^d)^’ i^d)i 3 

= argmin {{J^)^J 2 - {7dhi)(Pdh 

(pS)« 

+ v(^KL{{m^)^] {Xd)i\\{Pd)if^ ('/>d)y) + KL{{m^)j; {Xd)3\\{Pd)ij'^ 

= argmin “ 7y-i)Py- + V (m^)^ - {pd)^J?/‘^{<l>d)ij + log(^d)ij “ log((^d)* + J 

p^j L ^ \yd)iJ 

+ P i^d)] - {Plhf/‘^{<t>Dp + log(^d)y - log((^d)j + “ \ 

The above objective function is a quadratic function of {Pd)ij that implies a closed form update for 

{P%3- 

= o-^gmin ((/3^)^J2 - (/^d)*ji)((('d)ii 

+ p{KL{{mDp {X>Di\\{pDip {(l>Dij) + KL{{m^^)p (A^)^||(p^)y; {(l)%j)^ 

= argmin {jiP^ - +p\{m'l)i - (p^)»j)V2(<('d)y + log(</'d)y “ log(^d)* + “ T 

4‘ti J J Y 2(0^)y 2J 

+ P - (Pd)ii)V2(</>d)ii + log(0^)ii - log(^d)j + ^ 

We take the derivetive of the above objective function respect to {4'd)ij and set it to zero: 

{{I3d)ij2 — {Pd)ijl) + ?7 ~ ((^d)i ~ {Pd)p) /2(<)'d)ij + ~ 2 {(j)^)'^- 

+P ^ - {{md)j - (Pdh) /^{^d)l + 

By simplifing the above equation, we have: 

- {p:ihim)i+p[- (K)* - (p:;)y)'/2+( 0 :;)., - i(A:;).] = 0 

+ P - {{m'i)j - {p^dllf + W)i3 - (38) 

This is a quadratic function of {4>d)ij for which we can find an algebraic solution. It is easy to 
show that the form of update equations for {Pd^)ij and {4>dm)ij '^^e same as {Pd)ij {4’Dij 
respectively. Hence, we can find an algebraic solution for them. 
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Update for 


U',(t+l) _ 


{m^)l ' ' = argmin — 

(^d) 


Ni 


- 2{xdn)t{m^d)^ + 2 ^(TO^)*(m^)7(w^)i + 


+ ^J2 (i^d)mii^d)i - (Pdh))) 

j&B. 


ieSi 


((w^)i - (Pd)ii)V2(A^)j + log(A^)j - log(0;;)ij- + 


1 («* 1 


' 2(a;;), 2 


(39) 


Again, this is a quadratic function of {m^)i for which we can find an algebraic solution. 

Update for {rn^^)^ 


(Wd = argmin ^ 


Ni 


M 


‘2{xdn)i(jn^^)i + 2 ^ 'X ^d)ii^nm)i + E )i (-2^mn)z('^cZm)i )* 


n—1 


m' — l 


+ f - {Pdm)ij) + (7Zf)52 “ i^dm)j 

j^Bi 


{'ai'^m)i ~ (Pdm)ij) /‘^i^dm)ij + log('/’dm)*i ~ log(Adm)j + ^ ( iw V . o 


1 (a:^™)* 1 


- iPdmh) I^Wdm)ii + l0g(Cm)y “ log(A^^)j + - 


1 (AJ™), 1 


2 {rdJ^, 2. 


(40) 


Again, this is a quadratic function of {m^)i for which we can find an algebraic solution. 

Update for (A(^)i 


(Ad)7^’ = argmin + y(A^)i - log(A^), + ^ - Wh) + {I3‘"d)^}2{{4>d)^] - (Ad)^) 


e^ 


(K) 

+ 2??E 

jeBi 


jdBi 


- {pd)ij?/‘^{Kh + iog(A^)i - iog((/);;)ij + - 


1 {rd)^ 1 


2 (Ad)i 


(41) 


We take the derivetive of the above objective function respect to {X^)i and set it to zero: 

,-+ V 

2 2 


+ ^ - 7^^ + E ( “ (/5d)y2 

j&Bi A 


+ 2pE 

j&Bi 




1 


_1 i^dh 

iK)^ 2 (a;;)! 


= 0 


(42) 


By simplifing the above equation, we have: 

h 

2 


m +E i^dh^ - 


JdBi 


2r/E 

ieBi 


-iim^dh-ip:hr + i>^'ih-7;m 


(43) 


This is a quadratic function of (Ajj)i for which we can find an algebraic solution. 
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Update for 


Ni 


= argmin - log(A^„) 




n—1 


j&Bi ^ ^ 


+ 2r/E 

j&rSi 


iimdm)i - {Pdmhf l‘^{^dm)i + log(Ad„)i - log((/)^^)y + 


2 {KJ^ 2 . 


We take the derivetive of the above objective function respect to (A^^)i and set it to zero: 


I E«™). + + E 

r, = ^ A^-n- ^ 


+ 2pE 

ieBi 


jeBi 
2 In/\w ^2 




1 1 {rdm)i 

2(A^„)fJ 


= 0 


(44) 


(45) 


By simplifing the above equation, we have: 


Ni 


- (a::). + E ( 


n—1 


jeBi 


+ 2vE 

j&Bi 


- i«m)i - {Pdmhf + i>^din)i ” ni^PdJ 


= 0 


This is a quadratic function of (A^^)i for which we can find an algebraic solution. 

(7,71-7^ = (7,71-1 + - (P7ir'^), Vt €V,j€ 

iid)\T^ = (id^% + viiPdir^ - V* e u,J e 

(/37lu= (/37III + - (^7lE^). V* e V J e B. 

= (/37IS + - (A^E'^)- e u,J e 

(7,7)E'^ = (77.)111vi € u,j e b, 
(7,7)E'^ = (7,7)glV* € u,j e 6. 
= (/ 3 d 7 )gl + - (</>L^) 1 E^), V* ev,j€ s. 

(/37E'^ = (/ 37 S + viWir^ - (A7E'E V* e Vj e S. 


(46) 

(47) 

(48) 

(49) 

(50) 

(51) 

(52) 

(53) 

(54) 
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C Learning the hyper-parameters 

Instead of pre-specifying the local hyper-parameters ai, and di, ■we put local Gamma prior dis¬ 
tribution on them {9i ~ r(ae, bg), aim ~ r(aa, ba)TTi ~ r(aT, &r)) and try to learn them through 
our coordinate decsent parameter updates (■we use the Gamma distribution as the approximate pos¬ 
terior distribution of the hyper-parameters: q{aim) = g(0) = T{a'g., bg,), q{Ti) = 

r(a!^., b '^.)). Since there are no B-ADMM constraints on these parameters, their closed-form mean 
held updates can be easily obtained. Ho^wever, it should be noted that for updating each of the above 
update equations, in ■which the hyper-parameters appear, the hyper-paramerters should be replaced 

by their approximate posterior means. In other words, r^, aim, and 9i should be replaced by 
^ respectively. 

D Synthetic Data Result 

In this section, we show the empirical convergence properties of the D-BPCA. We generated 50 
dimensional 250 random samples from Af(5,0.8). We assigned 50 samples equally to each node in 
a 5-nodes network connected with ring topology to hnd a 5 dimensional subspace. Our convergence 
criterion is the relative change in objective of (6) in the main paper and we stop when it is smaller 
than 10“^. We initialized parameters with random values from a uniform distribution. All the results 
are averaged over 20 independent random initializations. 

Fig.|4^shows the convergence curve of D-BPCA for various p values. As can be seen, all p values 
lead to convergence within 10^ iterations. Furthermore, the value they converge to is equivalent to 
centralized solution meaning we can achive the same global solution using the distributed algorithm. 
Fig. [4^ shows convergence curve as a function of the number of nodes in a network. In all cases, 
D-BPCA successfully converged within 10^ iterations. Similar trends were observed with networks 
of more than 10 nodes. We also conducted experiments to test the effects of network topology on 
the parameter convergence. Fig.|4^depicts the result for three simple network types. In all cases we 
considered, D-BPCA reached near the stationary point within only 10 iterations. 

We consider two possibilities of missing values; the case when values missing at random (MAR) 
and the case when values missing not at random (MNAR). In [15], it has been shown that Bayesian 
formulations of PCA can deal with missing values more efficiently than probabilistic (non-Bayesian) 
PC A, particularly in the MAR setting. The same conclusion holds for D-BPCA. As shown in Fig.|4] 
D-BPCA has less reconstruction error than D-PPCA and can effectively reconstruct the original 
measurement comparable to its centralized counterpart under different amounts of missing values. 
This fact also holds for MNAR case although the error tends to be slightly larger than in the MAR 
case. 

E Further results for Caltech Dataset 

10% MAR and MNAR results (averaged over all 5 objects) are provided in the Table |4] Again, for 
both MAR and MNAR cases, the D-BPCA has better performance than D-PPCA. 
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Figure 4; Missing At Random experiment; Average root mean squared error of reconstructions 
based on PPCA, BPCA, D-PPCA, D-BPCA results. 
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Figure 5; Missing Not At Random experiment; Average root mean squared error of reconstructions 
based on PPCA, BPCA, D-PPCA, D-BPCA results. 


Table 4; Missing data Results of Caltech dataset (all results ran 20 independent initializations). 
Results provide variances over various missing value settings. Numbers are subspace angles between 
fully observable centralized SVD result versus D-PPCA / D-BPCA result. 


MAR MNAR 

D-PPCA Mean 4.0609 9.4920 
_Variance 1.2976 5.9624 

D-BPCA Mean 2.2012 7.2187 

Variance 1.3179 5.2853 
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